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Abstract 

We consider the dictionary problem in external memory and improve the update time of the well- 
known buffer tree by roughly a logarithmic factor For any A > max{lglgn,logj(^^^(7i,/i3)}, we can 
support updates in time O(-g) and queries in sublogarithmic time, Oilog^ n). We also present a lower 
bound in the cell-probe model showing that our data structure is optimal. 

In the RAM, hash tables have been used to solve the dictionary problem faster than binary search 
for more than half a century. By contrast, our data structure is the first to beat the comparison barrier 
in external memory. Ours is also the first data structure to depart convincingly from the indivisibility 
paradigm. 



1 Introduction 



The case for buffer trees. The dictionary problem asks to maintain a set S of up to n keys from the 
universe U, under insertions, deletions, and (exact) membership queries. The keys may also have associ- 
ated data (given at insert time), which the queries must retrieve. Many types of hash tables can solve the 
dictionary problem with constant time per operation, either in expectation or with high probability. These 
solutions assume a Random Access Machine (RAM) with words of 0(lgf/) bits, which are sufficient to 
store keys and pointers. 

In today's computation environment, the external memory model has become an important alternative to 
the RAM. In this model, it is assumed that there is an memory which is partitioned into pages of B words. 
Accessing each page in memory takes unit time. The processor is also equipped with a cache of M words 
(M/ B pages), which is free to access. The model can be applied at various levels, depending on the size of 
the problem at hand. For instance, it can model the interface between disk and main memory, or between 
main memory and the CPU's cache. 

Hash tables benefit only marginally from the external memory model: in simple hash tables like chain- 
ing, the expected time per operation can be decreased to 1 + 2"^^^) IIKnu73l . Note that as long as M is 
smaller than n, the query time cannot go significantly below 1. However, the power of external memory lies 
in the paradigm of buffering, which permits significantly faster updates. In the most extreme case, if a data 
structure simply wants to record a fast stream of updates without worrying about queries, it can do so with 
an amortized complexity of 0(l/i?) <^ 1 per insertion: accumulate B data items in cache, and write them 
out at once into a page. 

Buffer trees, introduced by Arge | Arg03] , are one of the pillars of external memory data structures, along 



with S-trees. Buffer trees allow insertions at a rate close to the ideal 1/B, while maintaining reasonably 
efficient (but superconstant) queries. For instance, they allow update time tu = O(^) and query time 
tq = O(lgn). More generally, they allow the following range of trade-offs: 



Theorem 1 (Buffer trees | Arg03) ). Buffer trees support updates and queries with the following tradeoffs: 



tu = 0{^\gn) tg = 0{\og^n), for2<X<B (1) 

tu = 0{^log^n) tg = 0{Xlgn), for2<X<§. (2) 

In these bounds and the rest of the paper, we make the following reasonable and common assumptions 
about the parameters: B > Ign; M > B^^'^ (tall cache assumption); n > M^~^^. 

The motivation for fast (subconstant) updates in the external memory model is quite strong. In applica- 
tions where massive streams of data arrive at a fast rate, the algorithm may need to operate close to the disk 
transfer rate in order to keep up. On the other hand, we want reasonably efficient data structuring to later 
find the proverbial needle in the haystack. 

A second motivation comes from the use of data structures in algorithms. Sorting in external memory 
takes 0(-§ log^f n) which is significantly sublinear for typical values of the parameters. Achieving a bound 
close to this becomes the holy grail for many algorithmic problems. Towards such an end, a data structure 
that spends constant time per operation is of little relevance. 

Finally, fast updates can translate into fast queries in realistic database scenarios. Database records 
typically contain many fields, a subset of which are relevant in a typical queries. Thus, we would like to 
maintain various indexes to help with different query patterns. If updates are efficient, we can maintain more 
indexes, so it is more likely that we can find a selective index for a future query. (We note that this idea is 
precisely the premise of the start-up company Tokutek, founded by Michael Bender, Martin Farach-Colton 
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Figure 1 : Selected update/query tradeoffs for buffer trees and our structure. 



and Bradley Kuszmaul. By using buffer-tree technology to support faster insertions in database indexes, the 
company is reporting^ significant improvements in industrial database applications.) 

The comparison/indivisibility barrier. In internal memory, hash tables can be traced back at least to 
1953 IIKnu73l . By contrast, in external memory the state-of-the-art data structures (buffer trees, B-trees, 
many others built upon them) are all comparison based! 

The reason for using comparison-based data structures in external memory seems more profound than in 
internal memory. Consider the simple task of arranging n keys in a desired order (permuting data). The best 
known algorithm takes time 0(min{n, logj^/ n}): either implement the permutation ignoring the paging, 
or use external memory sorting. Furthemiore it has been known since the seminal paper of Aggarwal and 
Vitter IIAV88II that if the algorithm manipulates data items as indivisible atoms, this bound is tight. It is 
often conjectured this this lower bound holds for any algorithm, not just those in the indivisible model. This 
would imply that, whenever B is large enough for the internal-memory 0{n) solution to become irrelevant, 
a task as simple as permuting becomes as hard as comparison-based sort. 

While external-memory data structures do not need to be comparison based, they naturally manipulate 
keys as indivisible objects. This invariably leads to a comparison algorithm: the branching factors that the 
data structure can achieve are related to the number of items in a page, and such branching factors can 
be achieved even by simple comparison-based algorithms. For problems such as dictionary, predecessor 
search, or range reporting, the best known bounds are the bounds of (comparison-based) B-trees or buffer 
trees, whenever B is large enough (and the external memory solution overtakes the RAM-based solution). It 
is plausible to conjecture that this is an inherent limitation of external memory data structures (in fact, such 
a conjecture was put forth by HYilOl ). 

Our work presents the first powerful use of hashing to solve the external memory dictionary problem, 
and the first data structure to depart significantly from the indivisibility paradigm. We obtain: 

Theorem 2. For any maxjlg Ig n, logj^^ n} < \ < B, we can solve the dictionary problem by a Las Vegas 
data structure with update time tu = 0{-^) and query time tq = 0(log;^ n) with high probability. 

At the high end of the trade-off, for A = B^, we obtain update time 0(1/ B^^"^) and query time 
0(log^ n) (See Figure \Q. This is the same as standard buffer trees. Things are more interesting at the 
low end (fast updates), which is the raison d'etre of buffer trees. Comparing to Theorem [T](IT}, our results 
are a logarithmic improvement over buffer trees, which could achieve tu = Ig n) and tq = 0(log;^ n). 

Interestingly, the update time can be pushed very close to the ideal disk ti'ansfer rate of 1/i?: we can 
obtain t™^"^ = ■ maxjlogj^/ n, Ig Ig n}). 

\http : / /tokutek . com/ 1 
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Note that it is quite natural to restrict the update time to logM '^)- Unless one can break the per- 
mutation bound, this is an inherent limitation of any data structure that has some target order in which keys 
should settle after a long enough presence in the data structure (be it the sorted order, or a hash-based or- 
der). Since buffer trees work in the indivisible model, they share this limitation. However, the buffer tree 
pays a significant penalty in query time to achieve tu = logj^/ n): from theorem [T]©, this requires 
tq > Ig n, which is significantly more than polylogarithmic (for interesting ranges of A/). By contrast, 
our bound on the query time is still (slightly) sublogarithmic. 

If one assumes M is fairly large (such as n'^), then A > Ig Ig n becomes the bottleneck. This is an 
inherent limitation of our new data stmcture. In this case, we can achieve t„ = O^-^) and tq = 
0(lgn/lglglgn). By contrast, buffer trees naturally achieve tu = O(^) and tq = O(lgn). With a 
comparable query time, our data structure gets exponentially closer to the disk transfer rate for updates. If 
we ask for tu = 0( ^^^" ) in buffer trees, then, from theorem [T](|2]i, we must have a huge query time of 

_ 2f2(lgn/lglg'^)_ 

Our result suggests exciting possibilities in external memory data structures. It is conceivable that, by 
abandoning the comparison and indivisibility paradigms, long-standing running times of natural problems 
such as predecessor search or range reporting can also be improved. 

Lower bounds. We complement our data stnicture with a lower bound that shows its optimality: 

Theorem 3. Let e > be an arbitrary constant. Consider a data structure for the membership problem 
with at most n keys from the universe [2n], running in the cell-probe model with cells ofO{B Ign) bits and 
a state (cache) of M bits. Assume B > Ign and M < v}^'^. The data structure may be randomized. Let 
tu be the expected amortized update time, and tq be the query time. The query need only be correct with 
probability 1 — C^, where is a constant depending on e. 
Iftu <l-e, then tq = (7(lgn/lg(S • tu)). 

Remember that for any desired tu > t™™ = • max{logjyj n, Ig Ig n}, our new data structure obtained 
a query time of = O [Ig n / lg{B ■ tu)). In other words, we have shown an optimal trade-off for any 
tu > We conjecture that for tu = o{t™™), the query time cannot be polylogarithmic in n. 

Our lower bound holds for any reasonable cache size, M < n^-^. One may wonder whether a better 
bound for smaller M is possible (e.g. proving that for tu = log^ n), the query time needs to be super- 
logarithmic). Unfortunately, proving this may be very difficult. If sorting n keys in external memory were 
to take time 0{n/B), then our data structure will work for any tu > Q{lglg n/B), regardless of M. Thus, a 
better lower bound for small cache size would imply that sorting requires superlinear time (and, in particular, 
a superlinear circuit lower bound, which would be a very significant progress in complexity theory). 

Remember that update time below 1/B is unattainable, regardless of the query time. Thus, the remaining 
gap in understanding membership is in the range E [-^, Milij 

At the high end of our trade-off, we see a sharp discontinuity between internal memory solutions (hash 
tables with = ~ 1) and buffer trees. For any t„ < 1 — e, the query time blows up to r2(log^ n). 

The lower bound works in the strongest possible conditions: it holds even for membership and allows 
Monte Carlo randomization. Note that the error probability can be made an arbitrarily small constant by 
0(1) parallel constructions of the data structure. However, since we want a clean phase transition between 
tu = I — £ and = 1, we mandate a fixed constant bound on the error. 

The first cell-probe lower bounds for external-memory membership was by Yi and Zhang HYZlOl in 
SODA' 10. Essentially, they show that if tu < 0.9, then tq > 1.01. This bound was significantly strengthened 
by Verbin and Zhang HVZlOl in STOC'IO. They showed that for any t„ < 1 — e, then tq = Q,{log^ 7i). 
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This bound is recovered as the most extreme point on our trade-off curve. However, our proof is sig- 
nificantly simpler than that of Verbin and Zhang. We also note that the technique of MVZIOII does not yield 
better query bounds for fast updates. A lower bound for small tu is particularly interesting given our new 
data stmcture and the regime in which buffer trees are most appealing. 

For update time t„ <^ our lower bound even beats the best known comparison lower bound. This 
was shown by Brodal and Fagerberg IIBF031 in SODA'03, and states that tq = Q{lgn/ \og{tuBlgn)). On 
the other hand, in the comparison model, it was possible to show IIBF03II that one cannot take tu ^ ^ ^^^m 
(below the permutation banier) without a significant penalty in query time: tq > n^^^\ 

2 Upper Bound 

Our data structure is presented in a number of levels. First, in section 12. 1[ we describe how we can map 
word-sized keys into keys with O(logn) bits; a brief discussion of how deletions can be handled using 
insertions also appears in this preliminary high-level section. 

In section 12.21 we then proceed to describe the core component of our structure, called a "gadget." 
The gadget is defined recursively, and a description of the recursive implementation of the operations is 
presented in Section 12.2.11 The small non-recursive gadget at the base of the recursion is nontiivial enough 
to merit a separate description appearing in Section 12.2.21 We present a high-level analysis of the gadget 
in Section [2.2.3[ some probability arguments about the size of gadgets and key-value-collisions are needed, 
which are isolated in Section 12.2.41 

Our gadget has certain size limitations, in that it can only be used as presented if it fits into cache. So, 
for larger data we use as global recursive structure a variant of buffer trees, where we switch to our gadgets 
when at levels of the recursion where the size requirements of out gadget are met. We elaborate on this idea 
in Section 1231 which completes the description of our structure. 

To summarize: after preprocessing ( ^2.11 ). a buffer-tree based structure is used, where the leaves are 
gadgets ( ^2.31 ): gadgets are defined recursively ( ^2.21 ) with a non-trivial base structure ( ^2.2.21 ). 

2.1 Preliminary key shrinkage and deletions 

As a warm-up, we show how we can assume that keys and associated data have 0(lg?i) bits. The data 
structure can simply log keys in an array, ordered by insertion time. In addition, it hashes keys to the 
universe [n^], and inserts each key into a buffer tree with an index into the time-ordered array (of O(lgn) 
bits) as associated data. A buffer-tree query may return several items with the same hash value. Using the 
associated pointers, we can inspect the tme value of the key in the logging array. Each takes one memory 
access, but we know that there are only 0(1) false positives with high probabiUty (w.h.p.), with a good 
enough hash function. 

Deletions can be handled in a black-box fashion: add the key to the logging array with a special asso- 
ciated value that indicates a delete, and insert it into the buffer tree normally. Throughout the paper, we 
will consider buffer trees with an "overwrite" semantics: if the same key is inserted multiple times, with 
different associated values, only the last one is relevant to the query. Thus, the query will return a pointer 
to the deletion entry in the log. After 0(n) deletions we perform global rebuilding of the data structure to 
keep the array bounded. 

From now on, we assume the keys have 0(lg n) bits. Let b = Q{Blgn) be the number of bits in a page. 



4 



2.2 Gadgets 



The fundamental building block of our structure is called a gadget. A t-gadget stores a multiset S from 
{[b] X [t] X [t]) X [t^]. We refer to the components of a tuple x = {{p, d,s),b) G S as: 

• the page hash, p £ [6] ; 

• the distribution hash, d G [t]; 

• the shadow hash, s £ [t]; 

• the backpointer, b G [t^]. Taken together, the page, distribution, and shadow hash codes are treated as 
a key, with the backpointer being associated data. 

Operations. A t-gadget stores a multiset supporting two operations: 

BULK-lNSERT(r): Insert a multiset T C ([6] x [t]^) x [t^] into the data structure {S ^ S U T). The 
multiset is presented packed into 0{ \\T\/ ) pages. 

Query(x): Given a key x G [6] x [t]^, return the (possibly empty) list of the backpointer values of all 
elements of S with key value x. We aim for time bounds proportional to the number of occurrences 
of elements with key value x in the multiset. 

Capacity invariants. Our constmction will guarantee that the the number of keys stored in a t-gadget is 
\S\ = 0{bt) with high probability. A t-gadget will occupy 0(|5'|/[p) = 0{tlgt) pages thanks to the use 
of a succinct encoding. At the beginning of any operation, there is no guarantee that any part of the t-gadget 
is in cache. However, we will only use gadgets that can completely fit in cache, i.e. the entire gadget can be 
loaded in cache by the update algorithm if desired. 

A recursive construction. At a high level, our data structure follows the same recursive construction used 
in the van Emde Boas layouttwo types of t-gadgets: the recursive t-gadget, and the base t-gadget which is 
used as a base case for small t. The description of the base i-gadget is deferred to 12.2.21 Here we define the 
recursive t-gadget. 

Recursive t-gadgets contain the following components: 

• The log: An array containing all the elements of S in the order of insertion. The last block of the array 
is the only one which may be partially full and is refeiTed to as the tail block of the log. 

• The top gadget g^: Aa. recursive -v/t-gadget. 

• The bottom gadgets gf: An array of y/t \/t-gadgets. 

All elements of S are stored in the log; furthermore, all elements of S except for those in the tail block 
will have a truncated representation of their keys recursively stored in either the top gadget or one of the 
bottom gadgets. Formally: 

Invariant 4. Let HIGH(-) and LOW(-) refer to the most and least significant half of the bits of their parameter 
Given an element x = {{p, d, s),b) exactly one of the following holds: 

1. Element x is stored in the tail block of the log. 

2. Element x is stored in block i of the log and {{p, HIGH((i), HIGH(s)), i) is stored in the top gadget, 
g^. The element {{p, HlGH((i), HIGh(s)), i) is called the "top-compressed" key. 
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3. Element x is stored in block i of the log and {{p, LOW (d), LOW {s)),i) is stored in S'I'iqjj^^)- The 
element {{p, hOw{d), LOw(s)), is called the "bottom-compressed" key. 

Given an element x stored in a t-gadget, the top-compressed version of x consists of the entire page 
hash of X, and the | Ig t higher-order bits of the distribution hash and shadow hash. The bottom-compressed 
version of x is constructed analogously using the page hash and and lower-order bits of the distribution 
hash and shadow hash. In both cases the backpointer indicates the page containing x in the log of the t- 
gadget. These compressed elements meet the size requirements for elements that can be bulk-inserted into 
\/t-gadget; i.e. the compressed key by definition is an element of [b] x Since |5| = @{bt), there ai^e 

at most Q{t Ig t) pages in the log, so the backpointer is easily within the required [t^]. 

The backpointer serves the following purpose: given a top or bottom compression of some element x 
relative to a specific and known i-gadget, x can be determined in 0(1) time by simply using the backpointer 
of the compression as an index into the log of the i-gadget. In this way the bits removed from the distribution 
and shadow hashes of of x to form its compression can be restored; as we will see, this allows the support 
of "uncompression" as recursive queries return. 

2.2.1 Implementation of gadget operations 

Query. A Query (x) operation proceeds as follows, where x = {p, d, s) is the current recursive compres- 
sion of the original hashed key: 

• Inspect the tail block of the log, and retrieve the associated backpointers of all occurrences of x from 
there. This takes one block read, and returns all the backpointers of the data satisfying case [T] of 
invariant |4] 

• Recursively call Query in the top gadget g'^ with the top-compressed key (p, HlGH((i), HIGh(s)). 
The top gadget will return a set of backpointers {pi,P2, ■ ■ ■}, which are indexes into the log of the 
current gadget. The set of items from the log includes all data satisfying case |2] of invariant lU plus 
possibly some false positives, where the compressed key matches the query key from the perspective 
of the top gadget, but not the full key is different. For any result returned by the top gadget, the query 
inspects the key in the log (taking constant time for the pointer access) and verifies the lower halves 
of the distribution and shadow hash codes, LOw(d) and LOw(s). If any of these differ, the result is a 
false positive and is discarded. Otherwise, the result is returned along with the original backpointer, 
retrieved from the log. We will later need to bound the number of false positives induced by hashing 
and compression. 

• Recursively call Query in the bottom gadget 5'^,Qjj('d) ^^^^ compressed key {p, hOw{d), LOw(s)). 
This returns all data satisfying case |3] of Invariants! together with some false positives that may be 
introduced by trimming the shadow hash code. For every recursive result, the query accesses the 
appropriate page in the log through the returned backpointer, and verifies that the lower half of the 
shadow hash code LOw(s) matches the key. If not, the result is discarded as a false positive. In case 
of a match, the original backpointer from the log is returned to the parent. 

Observe that recursing in the bottom gadget cannot introduce a false positive due to the distribution hash, 
since it is only keys with the same top bits as x that appear in the gadget. 

Bulk-Insert. An insertion proceeds as follows: 
1 . Add the inserted items at the end of the log. 
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2. If one or more than one blocks ai^e filled as a result of step [T] , the recursive top compressed repre- 
sentation of the data in the newly filled blocks is computed and the resultant top-compressed data is 
bulk-inserted into the top gadget. We call this a little flush. 

3. If the top gadget contains b\/t keys as a result of dD, it is declared to be full. It is then "destroyed" 
(initialized to an empty state) and all keys previously stored in the top gadget are bottom-compressed 
and inserted into the appropriate bottom gadget. We call this a big flush. To efficiently implement this 
operation, the data to be flushed is copied from the log, where it appears contiguously in uncompressed 
form, into cache, where it all fits according to the capacity invariant. There the data is bucketed (for 
free using any sorting algorithm, since we ai^e in internal memory) into \/t groups depending on the 
HlGH((i) field which indicates which recursive gadget it should be inserted into. Once the bucketing 
is complete, the data is converted into the appropriate compressed form, \/t recursive BULK-lNSERT 
operations are executed in the bottom gadgets. 

Note that this procedure enforces Invariant |4] by ensuring each item is either in the tail block of the log, 
in the top gadget, or in a bottom gadget. The capacity requirement of the top gadget is explicitly enforced. 
The capacity requirement of any bottom gadget holds w.h.p; this is argued in Section 12.2.41 

2.2.2 Base case 

We switch to the base case of the recursion when t < t™™, for a parameter t™'" to be determined (see ^2.31 ). 
To achieve the full range of our trade-offs, we need to use a different non-recursive construction for these 
small gadgets. Such a gadget maintains a single buffer page with the last < jp inserted keys. The rest of the 
keys are simply stored in a hash table (e.g. collision chaining) addressed by the page hash. This is the one 
place where the page hash is used. With a succinct representation, the table occupies 0{\S\/^) = 0{tlgt) 
pages. 

Query(x) inspects the buffer page and only one page of the hash table w.h.p. (since we have assumed 
B = Q{lgn), and the maximal chain is 0{lgn/ Iglgn) w.h.p.). Thus, a query takes time 0(1). Update 
simply appends keys to the buffer page. When the buffer page fills, all keys are inserted into the hash table. 
This operation may need to touch all O(tlgt) pages, since the new keys are likely to have hash codes that 
are spread out; however the cost can not exceed 0{tlgt) since the capacity invariants ensure the whole 
gadget fits in memory. 

2.2.3 Analysis of the t-gagdet 

In this section, we analyze the performance of gadgets, delaying the probabilistic analysis to the next section. 

Space usage. Though a key stored in a gadget may appear in ©(Iglgn) recursive gadgets, the repeated 
compression results in the space used by all compressed occurrences of the key is dominated by the top- 
level representation of the key. Formally, a single key in a t gadget occupies O(lgt) bits of space and 
may appear recursively in at most one A/t-gadget. Thus, the space per key is given by the recurrence 
S{t) < S{\/t) + 0{lgt), which solves to 0{lgt) bits. Overall, a t-gadget storing n keys uses 0{n) words 
of space. 

Update cost. Over its lifetime in a t-gadget, an element will be appended to the log once, participate in a 
little flush (being inserted into the top gadget) at most once, and participate in a a big flush (being moved 
into a bottom gadget) at most once. 
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We begin our analysis with the cost of BULK-lNSERT, excluding recursive calls. A key participates 
in a Bulk-Insert operation if it is one of the inserted items or participates in a big or little flush. If 
k is the number of participating items, the actual running time excluding recursive calls, is 0(1 + ^). 
Since is the (fractional) number of blocks occupied by a single element in a t-gadget, this is the fastest 
possible (linear time). Let us briefly explain how this is achieved for each of the three steps presented in 
the description of the BULK-INSERT operation. The first step, inserting at the end of the buffer, can be done 
efficiently since we required that the inserted data is already presented packed into pages. In the little flush, 
one or more blocks of data from the log need to be copied and converted into top-compressed form and 
delivered to single recursive BULK-INSERT; this can be done with a simple scan. The big flush, by definition 
is performed when the top -v/t-buffer is full, containing b\/i keys; the actual cost (including the calling of 
but excluding the execution of the recursive bulk insertions) is 0(-v/t log -v/t + \/t), with the \/t being a 
lower-order tem due to the \/t recursive calls to BULK-lNSERT. 

We now bound the recursive cost by amortizing. To cover the constant additive term, we assign an 
amortized 0(1) credit to every BULK-lNSERT operation. The credit for the recursive call in the top gadget 
can be paid because a recursive call is only made when we fill a page of the log. The credit for the recursive 
calls in the bottom gadgets is a lower order term compared to sorting the entire top gadget. 

Now we are left with a cost of 0{^^) for a BULK-lNSERT operation in which k keys participate. This 
translates into an amortized cost of O(^) per key. Let U{t) be the total cost charged to a key by a t-gadget, 
including recursive calls. This is described by the recun^ence: 

U{t) = O(^) + 2 • U{Vi); U{t < = OC-^) 

Observe that on each level of the recurrence, we have 2* terms of 0(| lg(t~^')), i.e. a constant total cost 
of O(^) per level. This is very intuitive: at each level, our key is broken into many components, but their 
total size stays Ig t bits. Since the cost is proportional to the bit complexity of the key, the cost of each level 
is constant; e.g. at the top level you recurse twice on keys of half the size. This property of the data structure 
is the most cmcial element in obtaining our upper bound. It only possible due to our compression of the 
keys; without compression, i.e. without violating indivisibly, the cost would increase geometrically at each 
level instead of remaining unchanged. 

The recursion for U (t) solves as follows: the recursion has 0(lg Ig t) levels at a cost of O(^) per level. 
In the base case, the recursion has Ig t/ Ig i™'" leaves, each of cost 0(^t™"^ Ig t™™). Thus the total cost is 

Uit) = i|^-0(lglg* + t"^"^). 

Query cost. The query time is proportional to the number of (true positive) results returned. Let Q{t) be 
the cost per result of a query in a t-gadget. We first note that Q{t) is at least 1, because any result has to 
be looked up in the log, in order to check whether it is a true positive. The queiy cost is proportional to the 
number of gadgets ti"aversed, and is described by the easy recursion: 

Qit) = i + 2-QiVty, Q(r'") = i 

The number of gadgets grows exponentially with the level, and is dominated by the base case. The total cost 
is therefore Q{t) = 0(lg t/ Ig ■ 

For each false positive encountered throughout the query, there is an additional cost of at most 0(lg ^^flin )■ 

Indeed, each key is in at most 0(lg y^fl^n ) levels at a time, and we need to do 0(1) work per level: we go 
to the appropriate page in the log to verify the identity of the key and retrieve its data. We will show later 
that the total overhead due to false positives is 0{Q{t)) w.h.p. 
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2.2.4 Probabilistic Analysis 

Capacity bounds. We first prove that no t-gadget g receives more than 0{bt) keys w.h.p. Note that 
shadow and page hashes are irrelevant to this question, and we only need to analyze distribution hashes. For 
now, assume that our hashing is truly random. 

Let g' be the lowest ancestor of g in the recursion tree which is a top-recursive gadget of its parent; say 
g' is a t'-gadget. We conventionally interpret the root gadget to be a top-recursive gadget, so g' is always 
defined. Note that g' could be g. Remember that BULK-lNSERT enforces a worst-case capacity bound on 
any top gadget by the big flush operation, so the number of keys of t' is at most b ■ t' in the worst case. A 
key from a t'-gadget ends up in a specific bottom gadget only if the first half of its distribution hash matches 
the identity of the bottom gadget. Recursively, keys that end up in a specific grandchild v^-gadget have the 
same prefix of | Ig(t') bits of the distribution hash. Since g' is the lowest ancestor of g that is a top-recursive 
gadget, all keys of g' that end up in g do so through bottom recursion, i.e. they will all have a common 
prefix of Ig(t') — Igt bits. Therefore, analyzing the number of keys of g' that end up in 51 is a standard 
balls-in-bins problem with the bt' balls of g' being distributed uniformly into j bins. The expected number 
of balls landing in gadget g is tb. Since b = uj{lg n), the Chernoff bound says that we have at most 0{tb) 
keys in the bin with high probability in n. 

False positives during Query. We now switch to analyzing the number of false positives encountered 
by QuERY(x). We count a false positive only once, in the first level of recursion where it is introduced. As 

noted above, a false positive introduced in a t-gadget induces an additive cost of 0(lg -r^isnr) ori the query 

^g * 

time. 

We claim that for any t, the number of the false positives introduced in all t-gadgets that the query 
traverses is 0{\ogi n) w.h.p. Thus, the total cost on one level of the recursion is • Ig j^^^)- At the 

i-th level of the recursion, we have Ig t = 2* Ig t™™, so the total cost is: 

V0( , 'f"" . , • Ig = o(-i^) = O(^) = OiQit)). 

^ Klgil' Igtmin) Igt"!™ / Vlgtm"!/^ Vlgtminy v^w/ 

Thus, the total cost due to false positives is 0{Q{t)) w.h.p. 

We must now prove our claim that the false positives introduced in all t-gadgets on the query path is 
0(log^ n) w.h.p. There ai^e log^ n t-gadgets on the path of the query, and each cares about a disjoint interval 
of Ig t bits of the distribution and shadow hash codes. For the analysis, we imagine fixing a growing prefix 
of the distribution and shadow hashes, in increments of Ig t bits. At every step, the fixing so far decides 
which keys land in the next t-gadget. Among these, only the most recent \/t • b can be in the top gadget 
at query time. One of these keys is a false positive iff it is different from the query, yet its page hash and 
the top half of its distribution and shadow hash codes match the query key. These hash codes consist of 
Igb + 2lg\/t random bits, so this event happens with probability ^ independently for each of the • b 
keys. For bottom compression, a key can only be a false positive iff it is different from the query, yet its page 
hash, full distribution hash, and the bottom half of the shadow hash match the query. These hash codes have 
Ig 6 + I Ig "v/t random bits together, so a key is a false positive in a bottom recursion with probability 
Over the path of the query through logj n different t-gadgets, there are b\/t ■ log^ n keys that could become 
false positives through top recursion (each independently with probability ^), and w.h.p. bt ■ logf n keys 
that could become false positive through bottom recursion (each independently with probability -g^)- The 

expected number of false positives among all t-gadgets is therefore 0( ^°^t." ). The Chernoff bound (in the 
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Poisson-type regime) shows that the number of false positives does not exceed 0{log^ n) with probability 
(l/\/t)^('°gt") = 2-"fe"), i.e. with high probability in n. 

Since our analysis only relies on Chernoff bounds, it holds even with weaker hash functions that satisfy 
such bounds. The main requirement for the hash function is that it satisfy Chernoff-type concentration 
even among the set of keys that share a certain prefix of the hash code. This is tme of /c-independent hash 
functions, since a ^-independent distribution remains A;-independent if we fix a prefix of the bits of the 
outcome. Thus, a ©(Ign) -independent hash function suffices for our data structure, considering Chernoff 
bounds with limited independence given by IISSS95II . These hash functions can be represented in O(lgn) 
words, which fits in cache, and therefore can be evaluated without I/O cost. 

2.3 Putting Everything Together 

We now describe how gadgets are used to make a dictionary structure that supports INSERT and SEARCH 
operations. Recall that the capacity invariant of our gadget requires that that the total size of a gadget is 
smaller than the cache size, so a single gadget in and of itself cannot be the target data structure. 

Our data structure is globally organized Uke the original buffer tree, except that each node is imple- 
mented by a gadget to support fast queries. At the global level, only comparisons are needed. 

Formally, we have a tree with branching factor M^. Each node can store up to Af keys. These keys are 
stored in a plain array, but also in an ^-gadget. The gadgets are independent random constructions (with 
different i.i.d. hash functions). The 0(logn)-bit key values are partitioned into distribution, shadow, and 
block hash fields when inserting and searching the gadgets in the nodes. When the capacity of the node fills, 
its keys are distributed to its children. Since M > (the tall cache assumption), this distribution is I/O 
efficient, costing 0(-^) per key. 

The total cost for inserting a key is: 

log,, n) + Uif)- log,, n = O(^) + 0(log,, n) • ^ • (Ig Ig M + Ig t^'") 

Thus tu = 0{^)- { log,/ n + Ig Ig M + t™''^ Ig i™"') . Note that log,., n + Ig Ig M = 6 (log,,, n + Ig Ig n). 
For any A > log,, n + Ig Ign, we can achieve = 0{X/B) by setting t™™ Igt™™ = A. 

The cost of querying a key is Qif) ■ 0(log,, n) = 0(j^ • log,, n) = 0{^) = 0{logx n). 

The total space of our data structure is linear. Each key is stored in only one top-level gadget at a time, 
and, as we argued in 12.2.31 these are have Unear size (in words). 
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A Lower Bounds 

Formally, our model of computation is defined as follows. The update and query algorithms execute on a 
processor with M bits of state, which is preserved from one operation to the next. The memory is an array 
of cells of 0{B Ign) bits each, and allows random access in constant time. The processor is nonuniform: 
at each step, it decides on a memory cell as an ai^bitrary function of its state. It can either write the cell 
(with a value chosen as an arbitrary function of the state), or read the cell and update the state as an arbitrary 
function of the cell contents. 

We first define our hard distribution. Choose 5 to be a random set of n keys from the universe [2n] , 
and insert the keys of S in random order. The sequence contains a single query, which appears at a random 
position in the second half of the stream of operations. In other words, we pick t G [§,Ti] and place the 
query after the t-th update (say, at time t + ^). 

Looking back in time from the query, we group the updates into epochs of A* updates, where A is a 
parameter to be determined. Let 5^ C 5 be the keys inserted in epoch i; \Si\ = A*. Let Zmax be the largest 
value such that Yl^j^T — f (remember that there are at least n/2 updates before the query). We always 
construct imax epochs. Let imin be the smallest value such that A*"""~^ > M; remember that M < n^^^ 
was the cache size in bits. Our lower bound will generally ignore epochs below imin, since most of those 
elements may be present in the cache. Note imax — imm > e log;^ n — 0(1). 

We have not yet specified the distribution of the query. Let Pno be the distribution in which the query is 
chosen uniformly at random from [2n] \ S. Let Dj be the distribution in which the query is uniformly chosen 
among Si. Then, Pyes = -■ 1 — TT(^i,„i„ H 1- Observe that elements in smaller epochs have 

^max tniin~r-l- ^ min max/ 

a much higher probability of being the query. We will prove a lower bound on the mixture | (Pyes + ^No)- 
Since we are working under a distribution, we may assume the data structure is deterministic by fixing its 
random coins (nonuniformly). 

Observation 5. Assume the data structure has error on ^(Pyes + ^No)- Then the error on Pno l^ 
most 2C£. At least half the choices ofi£ {imim ■ ■ ■ , ^max} cifs good in the sense that the error on T>i is at 
most 4:Cs. 
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We will now present the high-level structure of our proof. We focus on some epoch i and try to prove 
that a random query in Pno reads $1(1) cells that are somehow "related" to epoch i. To achieve this, we 
consider the following encoding problem: for k = A', the problem is to send a random bit vector ^[1 . . 2k] 
with exactly k ones. The entropy of A is log2 [^^) = 2k — 0(lg k) bits. 

The encoder constructs a membership instance based on its aiTay A and public coins (which the decoder 
can also see). It then runs the data structure on this instance, and sends a message whose size depends on the 
efficiency of the data structure. Comparing the message size to the entropy lower bound, we get an average 
case cell-probe lower bound. 

The encoder constructs a data structure instance as follows: 

• Pick the position of the query by public coins. Also pick S* = S \ Siby public coins. 

• Pick Q = {qi, . . . , q2k} a random set of k keys from [2n] \ S**, also by public coins. 

• Let Si = {qj I A[j] = 1}, i.e. the message is encoded in the choice of Si out of Q. 

• Run every query in Q on the memory snapshot just before the position chosen in step 1. 

The idea is to send a message, depending on the actions of the data structure, that will also allow the 
decoder to simulate the queries. Note that, up to the small probability of a query error, this allows the 
decoder to recover A: A[j] = 1 iff qj G S. 

It is crucial to note that the above process generates an S that is uniform inside [2n\ . Furthermore, each 
qj with A[j] = is uniformly distributed over [2n] \ S. That means that for each negative query, the data 
structure is being simulated on Pno- For each positive query, the data structure is being simulated on Vi. 

Let Ri be the cells read during epoch i, and Wi be the cells written during epoch i (these are random 
variables). We will use the convenient notation VF<j = Uj=o ^j- 

The encoding algorithm will require different ideas for = I — s and tu = o(l). 

A.l Constant Update Time 

We now give a much simpler proof for the lower bound of HVZIOI : for any < 1 — e, the query time is 
tq = f](logsn). 

We first note that the decoder can compute the snapshot of the memory before the beginning of epoch i, 
by simply simulating the data stmcture (5>j was chosen by public coins). The message of the encoder will 
contain the cache right before the query, and the address and contents of T4^<j. To bound \ W^i\ we use: 

Observation 6. For any epoch i, E[|VFj|] < X^tu. 

Proof. We have a sequence of n updates, whose average expected cost (by amortization) is cell probes. 
Epochs i is an interval of A* updates. Since the query is placed randomly, this interval is shifted randomly, 
so its expected cost is X^tu- □ 

We start by defining a crucial notion: the footprint Footj(g) of query q, relative to epoch i. Informally, 
these are the cells that query q would read if the decoder simulated it "to the best of its knowledge." Formally, 
we simulate q, but whenever it reads a cell from Wi \ VF<i, we feed into the simulation the value of the cell 
from before epoch i. Of course, the simulation may be incorrect if at least one cell from Wi \ W^i is read. 
However, we insist on a worst-case time of tg (remember that we allow Monte Carlo randomization, so we 
can put a worst-case bound on the query time). Let Footj(q) be the set of cells that the query algorithm 
reads in this, possibly bogus, simulation. 

Note that the decoder can compute Foot j {q) if it knows the cache contents at query time and W^i : it 
need only simulate q and assume optimistically that no cell from Wi \ W<^i is read. 
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For some set X, let Footi(X) = jj^^jj^^ Footj(g). Now define the footprint of epoch i: Fi = (Footi(5i)U 
Wi) \ F^i. In other words, the footprint of an epoch contains the cells written in the epoch and the cells 
that the positive queries to that epoch would read in the decoder's optimistic simulation, but excludes the 
footprints of smaller epochs. Note that _F<j = W<^i U (|J}=oFootj(5j)). 

We are now ready for the main part of our proof. Let Hh{-)he the binary entropy. 

Lemma 7. Let i G {imim • • • ) ^max} be good and k = A*. Let p be the probability over Pno that the 
query reads a cell from Fi. We can encode a vector of 2k bits with k ones by a message of expected size: 
0{X-% ■B\gn) + 2k- [H^,m^) + Ffcd) + H^^iZCe)] . 

Before we prove the lemma, we show it implies the desired lower bound. Choose \ = n, which 
implies A*~^tg -Sign = o(A*) = o{k). Note that Hh{^^) is a constant bounded below 1 (depending 
on e). On the other hand Hi,{'iC^) = 0{Ci, \g-^). Thus, there exist a small enough depending on e 
such that H},{1 — e) + Hh{3Cs) < 1. Since the entropy of the message is 2k — 0{lgk), we must have 
Hb{l - e) + Hhi^Ce) + Hb{p) > 1 - 0(1). Thus Hb{^) = so p = 0(1). 

We have shown that a query over Pno reads a cell from Fi with constant probability, for any good 
i. Since half the i's ai^e good and the Fj's ai^e disjoint by construction, the expected query time must be 
tg = 0(log;^?i) = ^(log^n). 

Proof of Lemmalll The encoding will consist of: 

1 . The cache contents right before the query. 

2. The address and contents of cells in F<j. In particular this includes W^i, so the decoder can compute 
Footi(g) for any query. 

3. Let X C Q \ Sihe the set of negative queries that read at least one cell from Fi. We encode X as a 
subset of Q, taking log2 (|^|) + 0{lgk) bits. 

4. For each cell c G \ F<j, mark some positive query q ^ Si such that c E Footj(g). Some cells 
may not mark any query (if they are not in the footprint of any positive query), and multiple cells may 
mark the same query. Let M be the set of marked queries. We encode M as a subset of Q, taking 

log2 < log2 bits. 

5. The set of queries of Q that return incorrect results (Monte Carlo error). 

The decoder immediately knows that queries from X are negative, and queries from Af ai^e positive. Now 
consider what happens if the decoder simulates some query q ^ Q\{X VJ M). If q reads some cell from 
Footj(M) \ F<j, we claim it must be a positive query. Indeed, M C Si, so Footi(M) \ F<j C Fj. But any 
negative query that reads from Fi was identified in X. 

Claim 8. If a query q & Q\ (X U M) does not read any cell from Foot j (M) \ F< j, the decoder can simulate 
it correctly. 

Proof. We will prove that such a query does not read any cell from Wi \ F^i, which means that the decoder 
knows all necessary cells. First consider positive q. If q reads a cell from Wi \ F^i and q is not marked, 
it means this cell marked some other q' G M. But that means the cell is in Footi(g') C Footi(M), 
contradiction. 

Now consider negative q. Note that Wi \ F^i C Fi, so if q had read a cell from this set, it would have 
been placed in the set X. □ 
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Of course, some queries may give wrong answers when simulated correctly (Monte Carlo error). The 
decoder can fix these using component 5 of the encoding. 

It remains to analyze the expected size of the encoding. To bound the binomial coefficients, we use the 
inequality: log2 (^) < n • Hh{^). We will also use convexity of Hb{-) repeatedly. 

1. The cache size is M < A*~i bits, since i > ^min- 

2. WehaveE[|F<i|] < E[|W^<i|] + \S<i\-tg < 0(A^"Hg). So this component takes 0{X'~'^ty ■ Blgn) 
bits on average. 

3. Since E[|X|] = pk, this takes 2k • Hi){^) bits on average. 

4. Since E[|Wj|] = tuk < (1 - e)k, this takes 2k ■ iffe(^) bits on average. 

5. The probability of an error is at most 2Ce on Pno and 4Ce on Pj. So we expect at most 6Ce • k wrong 
answers. This component takes 2k ■ //^(SCg) bits on average. 

This completes the proof of Lemma|7J and our analysis of update time = 1 — e. 
A.2 Subconstant Update Time 

The main challenge in the constant regime was that we couldn't identify which cells were the ones in Wi; 
rather we could only identify the queries that read them. Since t„ <^ 1 here, we will be able to identify those 
cells among the cells read by queries Q, so the footprints are no longer a bottleneck. 

The main bottleneck becomes writing W^i to the encoding. Following the trick of IIPT07II . our message 
will instead contain Wi n R^i and the cache contents at the end of epoch i. We note that this allows the 
decoder to recover W^i. Indeed, the keys S'<j are chosen by public coins, so the decoder can simulate the 
data structure after the end of epoch i. Whenever the update algorithm wants to read a cell, the decoder 
checks the message to see if the cell was written in epoch i (whether it is in Wi n i?<i), and retrieves the 
contents if so. 

We bound Wi n R^i by the following, which can be seen as a strengthening of Observation |6l 

Lemma 9. At least a quarter of the choices of i G {imiiD ■ ■ ■ ■, ^max} cire good and satisfy E[| n i?<i|] = 
0(V-ii„/log;,n). 

Proof. We will now choose i randomly, and calculate E[| PVj n i?<j|/A*], where the expectation is over the 
random distribution and random i. We will show E[|iyj n i?<j|/A*] = 0(t„/(Alog;^ n)), from which the 
lemma follows by a Markov bound and union bound (ruling out the i's that are not good). 

A cell is included in Wi n i?<.j if it is read at some time r that falls in epochs {0, . . . , i — 1}, and the last 
time it was written is some w that falls in epoch i. For fixed w < r, let us analyze the probability that this 
happens, over the random choice of the query's position. There are two necessary and sufficient conditions 
for the event to happen: 

• the boundary between epochs i and i + 1 must occur before w, so the query must appear before 
w + '^j<i < w + 2 A*. Since the query must also appear after r, the event is impossible unless 
r — w < 2A*. 

• the boundary between epochs i and i — I must occur in {w,r), so there ai^e at most r — w favorable 
choices for the query. Note also that the query must occur before r + X]j<j ' there are at most 
2A*~^ favorable choices. 

Let i* be the smallest value such that r — w < 2A* . Let us analyze the contribution of this operation to 
E[|Wj n i2<j|/A*] for various i.lfi < i* , the contribution is zero. For i = i*, we use the bound 2A'~^ for 
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the number of favorable choices. Thus, the contribution is 0(^^ • A~') = 0(^)- For i > i* , we use the 
bound r — w for the number of favorable choices. Thus, the contribution is 0(^-^ • A~*) = 0(^A**~*). 

We conclude that the contribution is a geometric sum dominated by O(^). Overall, there are at most 
n ■ tu memory reads in expectation, so averaging over the choice of i, we obtain j^^-^ • O(^) • ntu = 



Our main claim is: 

Lemma 10. Let i G {imin, • • • , Wax} be chosen as in Lemma^and k = A*. Let p be the probability over 
Pno ^hat the query reads a cell from Wi \ W^i. We can encode a vector of 2k bits with k ones by a message 
of expected size: 0{X'-Hu ■BlgX) + 0(A;t„lg|-) + 2k • [Hb{^) + Hb{3Ce)]. 

We first show how the lemma implies the desired lower bound. Let A be such that the first term in the 
message size is < A' = k. This requires setting A = 0{tuB lg(t„i?)). 

Note that the lower bound is the same for tu = 1 — e and, say, tu = 1/VB. Thus, we may assume that 
tu < 1/VB < l/^/lgT^. Therefore ktu Ig ^ = 0{k ■ = o{k). 

Since the entropy is 2k — o(lg n), we must have ^ + Hi,{^) + Hi,{2,Ce) > 1 — o(l). For a small enough 
Ce, we obtain = so p = 17(1). 

We have shown that a query over Pno reads a cell from Wi \ W^i with constant probability, for at least 
a quarter of the choices of i. By linearity of expectation tg = Q,{logx n) = f](lg n/\g{Btu))- 

Proof of Lemma [lOl The encoding will contain: 

1 . The cache contents at the end of epoch i, and right before the query. 

2. The address and contents of cells in Wi n i2<i. The decoder can recover M^<j by simulating the 
updates after epoch i. 

3. Let X C Q \ 5j be the set of negative queries that read at least one cell from Wi \ W^i. We encode 
X as a subset of Q. 

4. For each cell c G \ W<^i, mark some positive query q £ Si such that c is the first cell from Wi \ W<^i 
that q reads. Note that distinct cells can only mark distinct queries, but some cells may not mark any 
query if no positive query reads them first among the set. Let M be the set of marked queries, and 
encode M as a subset of Q. 

5. For each q G M, encode the number of cell probes before the first probed cell from Wi \ W^i, using 
0(lgtg)bits. 

6. The subset of queries from Q that return a wrong answer (Monte Carlo eixor). 

The decoder starts by simulating the queries M, and stops at the first cell from Wi \ W^i that they read (this 
is identified in part 5 of the encoding). The simulation cannot continue because the decoder doesn't know 
these cells. Let W*' ^ Wj \ W^i be the cells where this simulation stops. 

The decoder knows that queries from M are positive and queries from X are negative. It will simulate 
the other queries from Q. If a query tries to read a cell from W*', the simulation is stopped and the query is 
declared to be positive. Otherwise, the simulation is run to completion. 

We claim this simulation is correct. If a query is negative and it is not in X, it cannot read anything from 
Wi \ W^i, so the simulation only uses known cells. If a query is positive but it is not in M, it means either: 
(1) it doesn't read any cell from Wi \ W^i (so the simulation is correct); or (2) the first cell from Wi \ W^i 
that it reads is in W*, because it marked some other query in M. By looking at component 6, the decoder 
can correct wrong answers from simulated queries. 

It remain to analyze the size of the encoding: 
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1. The cache size is M = 0( A* ^) bits. 

2. By LemmalU E[| Wj n R<^i\] = 0(A*~^ j^g^), so this component takes 0{X'~^tu log A • B) bits. 

3. Since E[|X|] < p ■ k, this component takes E[log2 (|^|)] < 2k ■ Hb{^) bits. 

4. Since E[| Wi|] < XHu = ktu, this takes E[log2 (|^^|)] < k ■ 0{tu Ig ^) bits. 

5. This takes E[|Wj|] • O(lgtg) = 0{ktu Igiq) bits. 

6. We expect 2C^ ■ k wrong negative queries and AC^ ■ k wrong positive queries, so this component takes 
2k • Hi,{3Cs) bits on average. 

This completes the proof of Lemma [TOl and our lower bound. 
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